========================================================
This report explores a data set containing 113,937 Prosper loans. Prosper specializes is peer-to-peer (P2P) lending, allowing everyday people to both acquire and invest in loans for amounts ranging from $2,000 to $35,000. Prosper provides data to both potential borrowers and investors about their loans, with borrower data focused on the rate they will get on the loan, and investor data focuesed on the estimated return on investment.
The data set has 81 variables associated with each loan, such as: loan amount, borrower rate (i.e., interest rate), loan status, borrower income, borrower employment status, borrower credit history, estimated return for investors, etc. A description of each attribute in the loan dataset is provided here.
I will approach this analysis from the viewpoint of a potential investor. The data set provides information on both outstanding and closed outloans, so hopefully we can gain some insight on actual versus estimated return on investment. Do prosper loans live up to the hype?
Let’s start by setting up our workbook with the necessary packages, and loading in the dataset to get a sense of the variables available.
Before diving into univariate, bivariate and multivariate analysis. Let’s get an idea of how Propser has done over the time period of the provided data. Is the Prosper service growing, stagnant, or decreasing over time?
From the plot above, we see a drop in loan creation starting roughly around the 3rd quarter of 2008, lasting through quarter 2 in 2009… what’s going on here?
Some quick research via google and wikipedia states that in 2008 the SEC found Prosper to be in violation of the Securities Act of 1933 and forced Prosper to stop issuing loans. After some some rounds of litigation, and restructing of the underlying securities (now SEC regulated), Prosper reopened their P2P marketplace to continue issuing loans to borrowers and providing investment opportunities to investors.
For consistency, I will only analyze the data post SEC relaunch (July 2009 onwards) as it seems the data set itself underwent some structural changes (see variable descriptions) during the litigation phase.
Since mid 2009 we see a significant increase in issued loans via Prosper Marketplace. We also see what looks to be an increase in volatility in the number of weekly loans created. I wonder what is driving this volatility?
After getting some background infromation on Prosper and a brief look at their history and growth, let’s begin with some univariate analysis.
As an investor, and in a sense a lender (via Prosper), I am interested in learning about who borrows money using Prosper. I don’t think its unreasonable to assume that most borrowers are likely using Prosper as an alternative to traditional banks (perhaps due to a lack of eligibilty?), and therefore may pose a risky investment. As a result, I’ll start with their credit score of borrowers as a baseline.
I have included a figure for showing the corresponding FICO categories above as
Based on the ranges, it appears most borrowers have a “Fair” credit score. As expected, there are less borrowers with higher credit scores (right skewed distribution).
It would also be good to get a sense of how many credit lines borrowers have, as we may want to avoid borrowers with a lot of liabilities.
Most borrowers on prosper have around 8 lines of credit. Some borrowers have as much as 40 lines of credit! Unsurprisingly, the distribution is right skewed.
Next, I’ll get an idea of the incomes, employement status, and debt-to-income ratios for borrowers.
| Verifiable Income | n |
|---|---|
| False | 7333 |
| True | 77520 |
| Employement Status | n |
|---|---|
| Employed | 67310 |
| Full-time | 7927 |
| Not available | 0 |
| Not employed | 649 |
| Other | 3806 |
| Part-time | 256 |
| Retired | 367 |
| Self-employed | 4538 |
| Employement Status | Proportion |
|---|---|
| Employed | 0.7932542 |
| Full-time | 0.0934204 |
| Not available | 0.0000000 |
| Not employed | 0.0076485 |
| Other | 0.0448540 |
| Part-time | 0.0030170 |
| Retired | 0.0043251 |
| Self-employed | 0.0534807 |
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.000 0.150 0.220 0.259 0.320 10.010 7296
Note: Only the loans where a borrower had a verifiable income are shown in the above plots (91% of all loans). Also, incomes greater than $300k are not shown as the tail becomes imperceptibly thin beyond this value. As expected, the distribution of incomes is right skewed.
Most borrowers have a income of around $50,000 and a debt to income ratio of 0.2. Most borrowers are also employed (79%).
It might also be helpful to see the status of the loans issued using Prosper since 2009 to get an idea of how they are performing in a general sense. Are a large number of borrowers overdue on their loan payments? Have they defaulted? How many loans have been written off? How many were succesfully completed?
| Loan Status | n |
|---|---|
| Cancelled | 0 |
| Chargedoff | 5336 |
| Completed | 19664 |
| Current | 56576 |
| Defaulted | 1005 |
| FinalPaymentInProgress | 205 |
| Past Due (>120 days) | 16 |
| Past Due (1-15 days) | 806 |
| Past Due (16-30 days) | 265 |
| Past Due (31-60 days) | 363 |
| Past Due (61-90 days) | 313 |
| Past Due (91-120 days) | 304 |
| Loan Term | n |
|---|---|
| 12 | 1613 |
| 36 | 58695 |
| 60 | 24545 |
Between July 2009 and the last update of this data (03/11/2014), 19664 Prosper loans had been completed, 56576 loans were currently issued, 997 loans were in default (extended period in which borrower has not paid), 5326 had been charged off (money owed, but creditor has given up on collecting = BAD DEBT), 2067 loans were past due, and 205 were in final processing.
Prosper also provides some additional proprietary parameters (Prosper Scores & Prosper Ratings) of loan quality to investors. Prosper Scores help estimate the the probability of the loan going bad (going 60+ days past due in first year of loan), whereas the Prosper Rating appears to be a more overall description of loan risk. Remember, both are proprietary parameters provided by Prosper to potential investors. I’ll take a look at them quickly.
Next, I will take a look at the rates most borrowers get and the estimated returns for investors.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.18270 0.07408 0.09170 0.09607 0.11660 0.28370
It looks like most borrowers get a rate around 0.15 to 0.2 (median 0.1875), but we can see that a substational amount of borrowers get rates around 0.26. This distrubution is definitely not normal. Shockingly, we see that the most common rate for borrowers is 0.32 (32%)!
Estimated returns appear to follow a much more normal distrubution with a median estimated return of 0.09211 (9%). A very few number of loans are actually predicted to provide negative returns from the outset. Steer clear of these!
Although the estimated return for a loan is certainly information I would like when trying to choose one loan over the other. I think it is worth taking a look at actual loan performance. Unfortuantely, Prosper does not make this information available so we need to do a bit of work to tease this out of the data set.
I will use the following formula to calculate the Actual Return also known as the cumalative return, \(R_c\) (Source).
\[R_c = (LP_{CustomerPrincipalPayments} + LP_{InterestandFees} + LP_{CollectionFees} - P)/ P\]
Where P is the loan principal (or the LoanOriginalAmount)
Obviously, we can only assess the loans that are completed (i.e., have a closing date) for this information to be a true measure of actual loan returns.
Descriptive Statistics for loans that lose money:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.07314 0.14853 0.19139 0.27484 1.27108
Descriptive Statistics for loans that lose money:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -1.335543 -0.870506 -0.721707 -0.671794 -0.515325 -0.000001
Now this is some useful information! We can see that most loans have positive returns, which is expected otherwise Prosper would not be in business long! However, there are definitely loans that lose money for investors as well. Curiously, when a loan turns out to be a losing bet it more often than not tends to lose BIG. Both the positive and negative return loans are right skewed in their resepctive distributions (red dashed line at a return rate of 0.0).
The original structure of the data set is 113937 row (one for each loan) by 83 variables per loan.
Some cleaning was done to allow for easy handling of loan date information (convert to POSIXct). As stated earlier, only loans after 2009 are used in this analysis to ensure structural consistency of the data set.
In addition some variables were added to the data set which now contains records for 84,853 loans (rows) with 82 variables for each loan (columns).
A 83rd variable was also created, but only for the completed loans, which is stored in a separate data frame called loan_completed.
The main features of the dataset include: - Loan date/timing information (e.g., start date, end date) - Borrower information (e.g., credit score, income, etc.) - Loan quality informmation (e.g., Prosper Score, Prosper Rating) - Estimated loan perfromance information for investors (e.g., estimated returns)
I think the geogrpahic information will be interesting to investigate against the other variables. Do loans from certain geographies perform better than others? Perhaps a similar analysis could be done using loan listing categories or the borrowers occupation - are certain careers associated with more reliable borrowers?
Two variables were created to aid in this analysis:
These variables will help later during the bivariate and multivariate analysis. Hopefully, they can be used to some light on the type of loans that result in higher returns, and evaluate how well Prosper’s rating and scoring parameter represent true loan perfromance.
First, looking at weekly loans created was surprising as this shed some light on the growth of the Propser P2P marketplace and forced me to further investigate Prosper’s business model. Knowing that the Propser marketplace is regulated by the SEC does provide some more confididence that there is some oversight into the business as a whole. Though this does not mean you can’t lose money!.
Borrower rates showed an unusual distribution. Though partially right skewed, there are some clear spikes at higher rates. This suggests that Prosoper thinks most of it’s borrowers are a higher investment risk. Strangely, we don’t see a similar distibution in the Prosper Rating and Score of loans.
I was also surpised by the distribution of the negtaive ActualReturns (laons that lost money) were right skewed, which suggests that when borrowers stop making loan payments they tend to do so early on in the loan cycle, which leads to a higher loss as a percentage of the original loan value and investment by investors.
With some idea of the financial situation of borrowers using Prosper, let’s take a look at where most of the loans are originating from (by state) and what the loans are being used for.
## OGR data source with driver: ESRI Shapefile
## Source: "C:\Users\clu73519\Repos\Udacity\Exploratory_Data_Analysis\p6_explore_and_summarize_data", layer: "cb_2017_us_state_500k"
## with 56 features
## It has 9 fields
## Integer64 fields read as strings: ALAND AWATER
## STATEFP STATENS AFFGEOID GEOID STUSPS NAME LSAD
## 0 54 01779805 0400000US54 54 WV West Virginia 00
## 1 17 01779784 0400000US17 17 IL Illinois 00
## 2 24 01714934 0400000US24 24 MD Maryland 00
## 3 16 01779783 0400000US16 16 ID Idaho 00
## 4 50 01779802 0400000US50 50 VT Vermont 00
## 5 09 01779780 0400000US09 09 CT Connecticut 00
## ALAND AWATER
## 0 62265662566 489840834
## 1 143784114293 6211277447
## 2 25150696145 6980371026
## 3 214048160737 2393355752
## 4 23873457570 1031134839
## 5 12542619303 1815495323
## STATEFP STATENS AFFGEOID GEOID STUSPS
## 01 : 1 00068085: 1 0400000US01: 1 01 : 1 AK : 1
## 02 : 1 00294478: 1 0400000US02: 1 02 : 1 AL : 1
## 04 : 1 00448508: 1 0400000US04: 1 04 : 1 AR : 1
## 05 : 1 00481813: 1 0400000US05: 1 05 : 1 AS : 1
## 06 : 1 00606926: 1 0400000US06: 1 06 : 1 AZ : 1
## 08 : 1 00662849: 1 0400000US08: 1 08 : 1 CA : 1
## (Other):50 (Other) :50 (Other) :50 (Other):50 (Other):50
## NAME LSAD ALAND AWATER
## Alabama : 1 00:56 102254240128: 1 10264451012 : 1
## Alaska : 1 102266092821: 1 1027245114 : 1
## American Samoa: 1 105833282399: 1 1028643155 : 1
## Arizona : 1 106798015774: 1 1031134839 : 1
## Arkansas : 1 111904803121: 1 103998746281: 1
## California : 1 115881477379: 1 1081293682 : 1
## (Other) :50 (Other) :50 (Other) :50
Note: Much of the plotting technique for the choropleth plot of the U.S. states was adapted from this example (Source: The R Graph Gallery).
We can see on the above interactive chloropleth plot that Califronia has the most loans of any state in the data set, with 10,761 loans. We can also see that 5 states in particular stand out from the rest (>4000 lonas):
California may stand out above the rest because of the heavy tech presence where awareness of Prosper is likely the greatest. I would guess that the other states have higher loan counts for a similar reason, as they major cities that likely have established tech scenes or strong brand awareness. An altrenative reason could be that these states simply have more citizens who are looking for alternative financing, or larger populations in general.
Earlier we looked at distributions for different types pf borrower information, let’s now take a look at how some of these variables are related. As I am a potential investor, I will stick to looking at only the data for completed loans, as this gives the most complete picture of loan performance relative to the variables shown by Prosper when a loan is first available to investors.
Since there are 83 variables, I will need to trim down the number of variables I will look at to ensure this analysis doesn’t get too long! Let’s start with the following list:
I’ll start by seeing if certain loan Term lengths (12, 36, 60 months) result in more positive outcomes for investors (Completed vs. Defaulted or Charged off).
| Term | LoanStatus | n_loans | freq |
|---|---|---|---|
| 12 | Chargedoff | 72 | 0.0470281 |
| 12 | Completed | 1449 | 0.9464402 |
| 12 | Defaulted | 10 | 0.0065317 |
| 36 | Chargedoff | 4178 | 0.2010781 |
| 36 | Completed | 15791 | 0.7599865 |
| 36 | Defaulted | 809 | 0.0389354 |
| 60 | Chargedoff | 1086 | 0.2938312 |
| 60 | Completed | 2424 | 0.6558442 |
| 60 | Defaulted | 186 | 0.0503247 |
We see that based on the historical data (remember, we are only reviewing completed loans) that the 12 month Term loans result in less defaults or charge-offs, I wonder what the distribution of Estimated and Actual Returns are for loans of different term lengths.
| Term | mean_ActualReturn | median_ActualReturn |
|---|---|---|
| 12 | 0.0336337 | 0.0581120 |
| 36 | 0.0161184 | 0.1225870 |
| 60 | -0.1683072 | 0.0510413 |
The first thing I see is that the distribution for the Estimated and Actual returns are noticeabley different when split up by Term length. The estimated returns tend to be much less extreme in general (not as many big losses or excellent returns) when compared to the distrtibutions of the Actual returns.
Looking at the Actual returns more closely, we can see that all loan term lengths have both postive and negative Actual Returns (not the case for the estimated distributions), but we see that the 12 month terms have a “tighter” distribution, with not as extreme losses or extreme gains (+ve return) either. 36 month loan terms have the widest distribution with the largest range in Actual Returns, with 60 month terms falling in between.
Another interesting observation when looking at the distributions is that if if I were to only use the estimated distributions above in selecting loans, I would tend towards 60 month loans as they’re estimated (remember, this is Prosper’s estimate!) to have limited downside with reasonable returns on average. But, after looking at the distributions for Actual returns, I would actually feel better selecting 12 month loans as they have historically resulted in less losses when compared to 36 and 60 month loans.
Further, looking at the summary table of mean and median Actual Returns by term, we see that on average 36 month terms have the highest return. But after looking at the distributions, we also know that they have the largest range in Actual Returns, with some 36-month loans posting huge losses or providing big returns for investors. I think it is fair to treat the width of the distribution of actual returns for each loan term as a rough metric of the loan’s risk profile. The phrase “big risk, big reward” seems to ring true with the 36-month loans.
Interestingly, loans with 12 month terms tend out perform loans with 60 month terms (on average), but with a noticeably tighter distributions. This means that based on the historical loan data, we might estimate that we can get similar average returns from the 12 month loans as the 60 month loans (actually slightly higher based on this data), but with less risk for catostrophic losses. This comparison of central tendencies between the 12-month and 60-month loans does not capture the fact that 60-month loans have at times resulted in higher Actual Returns than the 12-month loan, again due to the wider distribution of returns.
Based on the above discussion, it seems there are some fairly large discrepancies bewteen Estimatedand Actual returns. Knowing that Prosper generates these estimates internally, I wonder if they have immproved their ability to more accurately represent Actaul loan perfromance with their Estimated returns as they have evolved as a company and lending platform over time?
Based on the GAM fit (learn more about GAM fitting here) (blue line), it looks like the difference between Estimated and Actual returns for loans has varied over time. Between 2009 and 2011 we see that, in general, Actual returns often exceeded estimated returns. However, after 2011 we see that Actual returns are more frequently less than the Estimated return, though grossly innacurate Estimated returns are still the minority.
Let’s see if there are other variables that correlate well with Actual returns as this is what we are after as prospective investors.
Correlation - Borrower Rate vs. Actual Return:
## [1] -0.06
Correlation - Prosper Rating Alpha vs. Actual Return:
## [1] -0.06
Correlation - Prosper Score vs. Actual Return:
## [1] 0.13
Correlation - Employement Status vs. Actual Return:
## [1] -0.01
Correlation - Listing Category vs. Actual Return:
## [1] -0.05
Correlation - Credit Score vs. Actual Return:
## [1] 0.01
Correlation - Monthly Income vs. Actual Return:
## [1] 0.05
Correlation - Monthly Loan Payment vs. Actual Return:
## [1] -0.05
Correlation - Loan Amount vs. Actual Return:
## [1] -0.07
From the above scatter plots and correlation coefficients (note: some factor variables were converted to numeric variables to calculate correlation), we see that none of the selected variables are well correlated with the Actual return a loan provides to an investor. It would appear that estimating a loan’s performance requires careful consideration of multiple variables. Not surprisingly, the Prosper Score is the best correlated (out of the varibales investigated) with Actual returns, which makes some sense as Prosper claims the Score accounts for multiple factors in assessing a loans likelihood of going bad (1 = worst score, 11 = best socre), or inversely not going bad.
From the viewpoint of an investor, I was surprised to see how little correlation there was between the variables I investigated with the Actual Returns of completed loans. Even well established metrics like a Borrower’s credit score, which affect all of our financial purchases/loans were not a good predictor of a loan’s ability to provide good returns. As stated earlier, the art of predicting loan performance is clearly more complicated and is influence by a codination of factors.
I was interested that when reviewing the distributions of Estimated and Actual returns (grouped by loan Term) that although the distributions of Estimated returns might entice investors to lean towards 60 month loans, the Actual return distributions suggested that 12 month loans may in fact be safer. In general, the median Actual returns for 12 and 60 month term loans were comparable (~5%), but the 12 month loans had a noticebaly tighter distribution - suggesting less risk of huge losses but also less opportunity of big returns.
The strongest relationship I found was between the Prosper Score and Actual return, though still quite weak with a correlation of only 0.13.
Let’s see if we can get a better idea of suitable borrowers by pairing Credit Scores grouped by Prosper Ratings and comparing them to Actual Returns.
| Credit Rating | Correlation |
|---|---|
| AA | 0.0851759 |
| A | 0.0145976 |
| B | 0.0547819 |
| C | -0.0587202 |
| D | -0.0141207 |
| E | -0.0181771 |
| HR | -0.1206439 |
Compared to the earlier plot comparing Credit Score to Actual Returns alone (no grouping), this plot provides some useful additional information. We can see that loans with borrowers having a lower Propser Rating (e.g., HR to D) have a much larger range in outcomes for investors, ranging from losses up to 100% to returns greater than 50%! Borrowers with higher ratings (e.g., A and AA) apprear to be less likely to result in losses for investors, but the returns are also much more tempered, topping out around approximately 20%, but more often in the range of 5 - 10%.
The linear models of Credit Score vs. Actual returns for each Prosper Rating are distinctinly different. Interestingly, for the the highest three Prosper Ratings (B - AA), we see positive slopes for the linear fit, but for the lower ratings (HR-C) we see a negative slopes. This suggests that for the the lower Prosper Ratings an increase in the Credit score of the borrower would predict a decrease in loan return for the investor, this seems counter intuitive.
These negative slope lines might also suggest some suspicious discrepancies between Credit Score and Prosper Rating; I would pose the question if it seems likely that a borrower with a Credit Score above 800 (an excellent score) would be assigned a Prosper Rating of HR?
Earlier, we looked at differences in the distribution of Estimated returns and Actual returns, now let’s see if we can identify sub groups (using group_by) where estimated and actual returns are more aligned. This might help us select a type of borrower/loan we might be interested in funding. Each plot will show the trendline for each subgroup relative to a 1:1 line (idealized case where Estimated Return=ActualReturn). I will also include the corresponding Pearson correlation coefficients so we can get a sense of the strength of the linear relationship (if at all) for each group. Recall the groups we used previously:
| Term | correlation |
|---|---|
| 12 | 0.0716802 |
| 36 | -0.0134302 |
| 60 | -0.1510703 |
| cut(BorrowerRate, seq(0, 0.3, 0.1)) | correlation |
|---|---|
| (0,0.1] | 0.0333315 |
| (0.1,0.2] | -0.0789230 |
| (0.2,0.3] | -0.0168653 |
| ProsperRatingAlpha | correlation |
|---|---|
| AA | 0.0003688 |
| A | -0.0957550 |
| B | -0.1801148 |
| C | -0.0829990 |
| D | 0.0264275 |
| E | 0.0440830 |
| HR | -0.0992159 |
| cut(ProsperScore, seq(0, 11, 1)) | correlation |
|---|---|
| (0,1] | -0.2210844 |
| (1,2] | -0.1842815 |
| (2,3] | -0.1486660 |
| (3,4] | -0.0563675 |
| (4,5] | -0.0260201 |
| (5,6] | 0.0401613 |
| (6,7] | 0.0179591 |
| (7,8] | 0.0261794 |
| (8,9] | 0.1263255 |
| (9,10] | 0.1047044 |
| (10,11] | -0.0464021 |
| BorrowerState | correlation |
|---|---|
| AK | 0.0842478 |
| AL | -0.0916366 |
| AR | -0.1001572 |
| AZ | 0.0205976 |
| CA | -0.0301480 |
| CO | 0.0169510 |
| CT | -0.0283825 |
| DC | -0.0567236 |
| DE | -0.0605831 |
| FL | -0.0842696 |
| GA | -0.0438953 |
| HI | 0.0279215 |
| ID | -0.1199919 |
| IL | -0.0579947 |
| IN | -0.0408548 |
| KS | -0.1360404 |
| KY | -0.1155863 |
| LA | 0.0136764 |
| MA | -0.0546082 |
| MD | -0.0352968 |
| MI | -0.0247903 |
| MN | -0.1134257 |
| MO | 0.0081018 |
| MS | -0.0813711 |
| MT | 0.1027679 |
| NC | -0.0778083 |
| NE | -0.1450814 |
| NH | 0.0260131 |
| NJ | -0.0924278 |
| NM | 0.0137732 |
| NV | -0.0669124 |
| NY | -0.0583702 |
| OH | -0.1011674 |
| OK | -0.0718065 |
| OR | -0.0483508 |
| PA | -0.0766041 |
| RI | -0.0150176 |
| SC | -0.0958114 |
| SD | -0.1543569 |
| TN | -0.1114087 |
| TX | -0.0408339 |
| UT | 0.0290206 |
| VA | -0.0782340 |
| VT | -0.0195200 |
| WA | -0.0676498 |
| WI | -0.0758997 |
| WV | -0.0074297 |
| WY | 0.0592308 |
| EmploymentStatus | correlation |
|---|---|
| Employed | -0.0984721 |
| Full-time | 0.0932328 |
| Not employed | -0.1122646 |
| Other | -0.0684336 |
| Part-time | 0.0463235 |
| Retired | -0.0698632 |
| Self-employed | -0.0884390 |
| ListingCategory | correlation |
|---|---|
| 0 | 0.8861940 |
| 1 | -0.0634273 |
| 2 | -0.0630281 |
| 3 | -0.0876882 |
| 5 | -0.0059913 |
| 6 | 0.0241384 |
| 7 | -0.0147582 |
| 8 | -0.1784004 |
| 9 | -0.0786902 |
| 10 | -0.1207947 |
| 11 | -0.2227871 |
| 12 | -0.1837545 |
| 13 | -0.2390733 |
| 14 | -0.2823494 |
| 15 | -0.2007895 |
| 16 | 0.0018699 |
| 17 | 0.1111640 |
| 18 | -0.1953193 |
| 19 | -0.2359011 |
| 20 | -0.1095555 |
| cut(CreditScore, seq(600, 900, 50)) | correlation |
|---|---|
| (600,650] | -0.0525404 |
| (650,700] | 0.0014448 |
| (700,750] | -0.0779940 |
| (750,800] | -0.1597190 |
| (800,850] | -0.1178692 |
| (850,900] | -0.0926533 |
| cut(CreditScore, seq(600, 900, 50)) | correlation |
|---|---|
| (600,650] | -0.0448994 |
| (650,700] | 0.0075433 |
| (700,750] | -0.0645377 |
| (750,800] | -0.1331466 |
| (800,850] | -0.1112680 |
| (850,900] | -0.1207721 |
| cut(MonthlyLoanPayment, seq(0, 1000, 200)) | correlation |
|---|---|
| (0,200] | -0.0475687 |
| (200,400] | -0.0580389 |
| (400,600] | -0.1902531 |
| (600,800] | 0.0502148 |
| (800,1e+03] | 0.0259390 |
From the plots above and their corresponding correlation coefficients (typically less than 0.15), we have further demonstrated the complexity in estimating returns of individual loans. Although the trend line for certian sub-groups closely aligns with the 1:1 line, such as: Borrower Rates between 0 and 0.1, and Borrowers who have listed their employement status as “Full-time”, we can see the correlation between estimated and actual returns for these groups is still very low (0.033, and 0.093 respectively). This suggests that the linear models shown in the plots do not well define the variance in the data. It would seem likely that a multivaritae model might be a better approach for eastimating returns as claerly the linear models evaluated above aren’t doing a very good job.
Interestingly, there is one sub-group that has a trendline entirely above the 1:1 (better actual returns than estimated) and a strong correlation (0.89); Listing Category 0 (“loan purpose not listed”).
kable(subset(df_loan, ListingCategory == 0 ),
caption = "Loans with no Listing Category listed")
| ListingCreationDate | Term | LoanStatus | BorrowerRate | EstimatedReturn | ProsperRatingAlpha | ProsperScore | BorrowerState | EmploymentStatus | ListingCategory | CreditScore | StatedMonthlyIncome | DebtToIncomeRatio | LoanOriginalAmount | MonthlyLoanPayment | ActualReturn | Actual_less_Estimated_Return | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 7194 | 2011-08-27 | 36 | Completed | 0.2399 | 0.1243 | D | 5 | OH | Employed | 0 | 710 | 2416.667 | 0.19 | 6350 | 249.10 | 0.2605921 | 0.1362921 |
| 11322 | 2011-01-15 | 36 | Completed | 0.0655 | 0.0400 | AA | 10 | AK | Employed | 0 | 810 | 5166.667 | 0.14 | 3200 | 98.15 | 0.0481750 | 0.0081750 |
| 11365 | 2011-03-13 | 36 | Completed | 0.2199 | 0.1068 | D | 7 | FL | Full-time | 0 | 650 | 3750.000 | 0.08 | 2000 | 76.37 | 0.1666500 | 0.0598500 |
| 12771 | 2011-02-20 | 36 | Completed | 0.2199 | 0.1068 | D | 7 | AL | Other | 0 | 670 | 3378.000 | 0.30 | 4955 | 189.21 | 0.2464521 | 0.1396521 |
| 17310 | 2011-04-13 | 36 | Completed | 0.1656 | 0.1009 | B | 7 | AR | Employed | 0 | 690 | 9166.667 | 0.41 | 15000 | 531.51 | 0.2276827 | 0.1267827 |
| 22945 | 2012-11-02 | 36 | Completed | 0.3177 | 0.1124 | HR | 5 | MA | Employed | 0 | 690 | 5416.667 | 0.20 | 4000 | 173.71 | 0.1746150 | 0.0622150 |
| 25380 | 2011-01-21 | 36 | Completed | 0.2199 | 0.1068 | D | 6 | IL | Employed | 0 | 670 | 6666.667 | 0.23 | 2500 | 95.46 | 0.2072080 | 0.1004080 |
Although this sub-group of loans seems like the best investment possibility found thus far, a closer look shows that only 7 of these loans have been recorded in the dataset (of completed loans), which is far too small asample size to make an informed decision.
Talk about some of the relationships you observed in this part of the
investigation. Were there features that strengthened each other in terms of
looking at your feature(s) of interest?
The effect of loan term length seemed failry obvious, with loans having Terms of 12 months appearing to result in better performance relative to estimates when compared to the 36 and 60 month loans, as shown by the postive slope of the trend line. Prosper ratings also seemed tobe indicative of general loan performance, with higher ratings having better performance relative to their estimated returns when compared with lower ratings. This gives us some faith in Prosper’s proprietary metrics. Borrower rates less than 0.10 also seemed to perform well againts estimates in general.
As statetd earlier, the trend line for “full-time” employed borrowers lines up well with the 1:1 line (estimated return = actual return), but surprisingly, the “part-time” borrowers loans often out performed the full-time loans for a given estimated return. I would have guessed that loans from fully employeed borrowers would have had better outcomes in a general sense.
Perhaps niavely of me, I was surpsied at how poor the correlation of estimated and actual returns were, even when splliting up the loans into sub-groups based on other variables in the data set (e.g., credit score). The plots above are a true demonstration in the complexity of identify loan types where the estimated returns by Propser are truly indicative of the actual returns experienced by investors.
I selected the first plot because of it’s usefulness in guiding the rest of my analysis. Without having plot a time series of the loan creation dates, I would have never realised that that the Prosper data set itself had structural changes in how data was recorded. This brought me to the decision of removing all loans that occurred before the 2008 SEC regulation of Prosper’s platform, which I believe improved the quality of the analysis. Otherwise I would have been comparing apples to oranges.
| Term | mean_ActualReturn | median_ActualReturn |
|---|---|---|
| 12 | 0.0336337 | 0.0581120 |
| 36 | 0.0161184 | 0.1225870 |
| 60 | -0.1683072 | 0.0510413 |
I chose the second plot(s) because they provided the initial insight between estimated loan returns and actual returns. As stated earlier, the distribution of actual returns covered a much larger range than is estimated by Prosper.Some loans even resulte in major losses for investors. These plots also helped get an idea of the relative levels of risk between loans of different Term lengths, with 12-month loans appearing to be the least risky (less negative outcomes, but also modest returns) and 36-month loans appearing to have the most risk (losses up to 100% but also some returns greater than 50%). This discovery of the difference in estimated and actual returns led me to dig deeper to see if there were a group or type of loan where estimated and actual returns were similar, which would have provided a lot of value information to me as an investor.
| Borrower Rate | Correlation |
|---|---|
| (0,0.1] | 0.0333315 |
| (0.1,0.2] | -0.0789230 |
| (0.2,0.3] | -0.0168653 |
I seleceted the final plot above as it provides three take-aways:
As shown above, most of the loans (for all borrower rates) do in fact make positive returns, but betting on any single loan could have a massive range of outcomes (rags-to-riches and vice versa). Therefore it seems prudent to diversify investments across many different types of loans.
Analysis of loan data from 113,937 Prosper loans was performed. Initial investiagtions reduced the data set size to 84,853 loans due to changes in data structure after 2008 SEC regulations were imposed on the Prosper p2p lending marketplace. The data set was further reduced in size to 26005 loans so that only completed loans were analyzed. The distribution for multiple vairabels were reviewed and discussed.
Some of ths struggles I encountered were in trying to make an interactive chloropleth plot using leaflet and learning how to handle and interact with geospatial data in R (e.g., shape files, SpatialPolygonDataFrame). It turns out this plot wasn’t terribly useful from an investment standpoint but was a valuble learning experience.
The biggest struggle I had was in finding a variable that was well correlated with the actual return of a loan. Unsurprisingly, there are many factors to consider when evaluating a loan as a potential investement and trying to distill an investment strategy to a single input variable is essentially impossible.
Further analysis could be made to take a multivariate approach to fitting a model that better describes actual loan performance and the wide variety of outcomes that may befall any investor for any given loan based on the variables provided by Prosper.